Replication materials for "Messy Data, Robust Inference? Navigating Obstancles to Inference with bigKRLS"
By: Pete Mohanty (pmohanty@stanford.edu) and Robert Shaffer (rbshaffer@utexas.edu)

------

See individual folders for relevant materials for each figure/paper section. Since raw bigKRLS model outputs consume a substantial amount of hard drive space (~600 MB for the main results in our paper), we have only included a fit bigKRLS object for the applied model we present in Section 5 of our paper, not the hundreds of simulations or crossvalidation replicates. In all other cases, we include code, summaries, and partial model outputs that should be sufficient to reproduce the figures we present in our paper and our appendices. See below for details on the contents of each folder, and feel free to reach out to the authors with questions. 

IMPORTANT: most scripts will not run without changing several paths at the beginning of each script. Be sure to modify paths to match your system organization before running project code!

-------------
Folder list:
-------------

Figure_1 ("Actual" marginal effects plot):

 - Contains:
   - R code (Figure_1/sinfx.R) used to generate the sample marginal effects plot given in Figure 1 (sincurve.pdf).

 - To run:
   - See Figure_1/sinfx.R to generate Figure 1.

Figure_2 (journal sample size plot):

 - Contains 
   - Raw journal article sample size file (Figure_2/journal_articles.csv)
   - Code used to generate the sample size figure (Figure_2/journal_articles.R)

 - To run:
   - See Figure_1/journal_articles.R to generate Figure 2.

Figure_3 (runtime comparison): Code and original output found in paper. To compare runtime across versions, care should be taken with eigentruncation options. Figures given in-text use eigtrunc = 0, which differs from the default bigKRLS setting. 

Table_2 (eigentruncation): Demonstrates runtime and numeric convergence of various hyperparameters discussed in paper. Note the options differ slightly between bigKRLS and KRLS and so care needs to be taken to establish convergence.

2016_election_ests: Contains the fit bigKRLS model object used to generate the estimates presented in-text. Results can be accessed using the load.bigKRLS() function, but should not be viewed or modified directly.
 
2016_election_application:

 - Contains:
   - Raw data files (2016_election_application/2016_election_data/datasets) and Python code  (2016_election_application/2016_election_data/create_df.py) used to generate the 2016 election dataset used in Section 5 and Appendices C-E.
   - Final 2016 election dataset (2016_election_application/2016_election_dataset.csv) generated using the raw data files and Python code mentioned above. (Contains all data used in Section 5 and Appendix E but not fully formatted for regression.)
   - R code used to to format data to be passed to bigKRLS which is discussed in Section 5 and Appendix E. Specifically, 2016_election_application/building_2016_data_frame.R generates X_2016.csv and y_gop_2016_delta.csv which are ready for bigKRLS(). analyzing_2016.R estimates the model creates subsequent tables and graphics. 

 - To run:
   - Run 2016_election_application/2016_election_data/create_df.py (using Python 2) to generate the election dataset. Be sure to change paths to ensure that the script will run.
   - Run 2016_election_application/analyzing_2016_data.R to produce the bigKRLS model analyzed in Section 5. This script also contains code used to generate the figures in Section 5 and Appendix E (spatial first differences). 

Appendix_C (degrees-of-freedom and subsampling simulations):

 - Contains:
   - Code used to generate the subset of the 2016 election dataset used for simulations in Appendix C
   - Code used to generate the degrees-of-freedom simulations in Appendix C.1-C.2
   - Code used to generate the subsampling simulations in Appendix C.3

 - To run:
   - Run Appendix_C/create_data_small.R to generate Xsmall.RData (predictor variable matrix for Appendix C)
   - Run Appendix_C/Appendix_C.1-2/sims_pops_coverage.R to run the simulations described in Appendix C.2, and run Appendix_C/Appendix_C.1-2/visualizations.R to evaluate and create visualizations contained in that appendix.
   - Run Appendix_C/Appendix_C.3/sims_subsampling.R to run the simulations described in Appendix C.3, and run Appendix_C/Appendix_C.3/visualizations.R to evaluate and create visualizations contained in that appendix.


Appendix_D (cross-validation):

 - Contains:
   - Code used to generate the reduced-dataset cross-validation results given in Appendix D.1 (Appendix_D/Appendix_D.1)
   - Code used to generate the full-dataset cross-validation results given in appendix D.1 (Appendix_D/Appendix_D.2)
   - main results found in Appendix_D/Appendix_D1.1/cv_results.csv, Appendix_D/Appendix_D.2/kcv_krls/kcv_seeds_1_to_100.csv, Appendix_D/Appendix_D.2/kcv_non_krls/kcv_results*.RData

 - To run:
   - Run Appendix_D/Appendix_D.1/cv_election2016_D1.R to generate the cross-validation results in Appendix D.1
   - Run Appendix_D/Appendix_D.2/kcv_non_krls/crossvalidate_2016.R to generate the underlying data for the non-KRLS entries in the cross-validation table in Appendix D.1. To accumulate results, run Appendix_D/Appendix_D.2/kcv_non_krls/evaluate_crossvalidation.R


